In rdrr1990/kerasformula: A High-Level R Interface for Neural Nets

library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, comment="")
library(ggplot2)

AWS Movie Data with kerasformula

library(kerasformula)
movies <- read.csv("http://s3.amazonaws.com/dcwoods2717/movies.csv")
dplyr::glimpse(movies)

Classifying Genre

sort(table(movies$genre))

out <- kms(genre ~ . -director -title, movies, seed = 12345)

plot(out$history) + labs(title = "Classifying Genre", 
                         subtitle = "Source data: http://s3.amazonaws.com/dcwoods2717/movies.csv", y="") + theme_minimal()

The classifier does quite well for the top five categories but struggles with rarer ones. Does adding director help?

out <- kms(genre ~ . -title, movies, seed = 12345)

plot(out$history) + labs(title = "Classifying Genre", 
                         subtitle = "Source data: http://s3.amazonaws.com/dcwoods2717/movies.csv", y="") + theme_minimal()

Doesn't hurt much but introduces overfitting.... Including only the top directors doesn't make big improvements but doesn't have the overfitting issue.

movies$top50_director <- as.character(movies$director)
movies$top50_director[rank(movies$director) > 50] <- "other"
out <- kms(genre ~ . -director -title, movies, seed = 12345)

plot(out$history) + labs(title = "Classifying Genre", 
                         subtitle = "Source data: http://s3.amazonaws.com/dcwoods2717/movies.csv", y="") + theme_minimal()